Covered
Linear regression:
Simple linear regression models the relationship between a response variable (Y) and a predictor variable (X).
The sample regression equation is:
\[\hat{Y} = a + bX\]
Where:
Method of Least Squares: The line is chosen to minimize the sum of squared vertical distances (residuals) between observed and predicted Y values.
Simple linear regression models the relationship between a response variable (Y) and a predictor variable (X).
The sample regression equation is:
\[\hat{Y} = a + bX\]
Where:
Method of Least Squares: The line is chosen to minimize the sum of squared vertical distances (residuals) between observed and predicted Y values.
Call:
lm(formula = age_years ~ proportion_black, data = lion_data)
Residuals:
Min 1Q Median 3Q Max
-2.5449 -1.1117 -0.5285 0.9635 4.3421
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8790 0.5688 1.545 0.133
proportion_black 10.6471 1.5095 7.053 7.68e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.669 on 30 degrees of freedom
Multiple R-squared: 0.6238, Adjusted R-squared: 0.6113
F-statistic: 49.75 on 1 and 30 DF, p-value: 7.677e-08
The calculation for slope (b) is:
\[b = \frac{\sum_i(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_i(X_i - \bar{X})^2}\]
Given: -
\(\bar{X} = 0.3222\)
\(\bar{Y} = 4.3094\)
\(\sum_i(X_i - \bar{X})^2 = 1.2221\)
\(\sum_i(X_i - \bar{X})(Y_i - \bar{Y}) = 13.0123\)
b = 13.0123 / 1.2221 = 10.647
Intercept (a):
\(a = \bar{Y} - b\bar{X} = 4.3094 - 10.647(0.3222) = 0.879\)
Making predictions:
To predict the age of a lion with 0.50 proportion of black on its nose:
\[\hat{Y} = 0.88 + 10.65(0.50) = 6.2 \text{ years}\]
Confidence intervals vs. Prediction intervals:
Both intervals are narrowest near \(\bar{X}\) and widen as X moves away from the mean.
In addition to getting estimates of population parameters (intercept - β0 , slope - β1)
want to test hypotheses about them
Total variation in Y is “partitioned” into 3 components:
sum of squared deviations of each observation (\(y_i\)) from mean (\(\bar{y}\))
dfs = n-1
Total variation in Y is “partitioned” into 3 components:
sum of squared deviations of each observation (\(y_i\)) from mean (\(\bar{y}\))
dfs = n-1
Total variation in Y is “partitioned” into 3 components:
Sums of Squares and degress of freedome are:
\(SS_{regression} +SS_{residual} = SS_{total}\)
\(df_{regression}+df_{residual} = df_{total}\)
Sums of Squares converted to Mean Squares
Regression typically tests null hypothesis that β1 = 0
Can test in two ways:
\[t=\frac{b_1-\theta}{s_{b_{1}}}\]
Regression typically tests null hypothesis that β1 = 0
Can test in two ways:
\[F = \frac {MS_{regression}}{MS_{residual}}\]
\[r^2 = \frac{SS_{regression}}{SS_{total}}=1-\frac{SS_{residual}}{SS_{total}}\]
\[r^2 = \frac{SS_{regression}}{SS_{total}}=1-\frac{SS_{residual}}{SS_{total}}\]
\[F = \frac{MS_{regression}}{MS_{residual}}\] \[r^2 = \frac{SS_{regression}}{SS_{total}}\]
Call:
lm(formula = age_years ~ proportion_black, data = lion_data)
Residuals:
Min 1Q Median 3Q Max
-2.5449 -1.1117 -0.5285 0.9635 4.3421
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8790 0.5688 1.545 0.133
proportion_black 10.6471 1.5095 7.053 7.68e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.669 on 30 degrees of freedom
Multiple R-squared: 0.6238, Adjusted R-squared: 0.6113
F-statistic: 49.75 on 1 and 30 DF, p-value: 7.677e-08
“Lion age (years) could be predicted from nose spots (percentage) using the simple linear regression model age = 10.67 * proportion_black + 0.8749. Regression analysis showed that the slope of the relationship was significantly (at α=0.05) different than 0 (\(F_{1,30}\) = 49.93, p < 0.0001, R² = 0.6247).”
Note there is an adjusted R² - what is that - accounts for the number of predictors in your model - adjusts the R² value by penalizing the addition of variables that don’t improve the model fit significantly
The formula for adjusted R² is:
\[ R^2 = 1 - \frac{(1 - R²) × (n - 1)}{(n - p - 1)}\]
Where:
\(R^2\) measures the proportion of variance in the dependent variable (age_years) that is explained by the independent variable (proportion_black).
Linearity:
Check:
If violated:
Normality:
Check:
If violated:
Homogeneity of variance:
Check:
If violated:
Independence:
Check: determine correlation coefficient bw adjacent residuals
If violate:
Fixed X:
If violated:
Outlier or Influence:
Residual plots
residuals vs. predicted y:
can be used to assess assumptions:
Weighted least squares:
Robust regression:
LAD:
M-estimators:
Rank-based: “if all else fails”
Fixed X is assumption of regular regression
what if X random (typical case)?
If goal is prediction (interpolation) then Model I is ok…
if goal is correct parameters and error estimates, may need to use Model II
Model II regression - Approach underused in ecology
MA and RMA approach slightly different
OLS (Model I) - Use when:
MA (Major Axis) - Use when:
SMA (Standardized Major Axis) - Use when:
RMA (Reduced Major Axis) - Use when:
The slopes of these methods will typically follow this pattern when the correlation coefficient is less than 1: OLS slope < MA slope < RMA slope < inverse of OLS (X on Y) slope This is particularly evident when the correlation between X and Y is weaker. As correlation approaches 1, the differences between methods diminish.
Here’s a simplified decision tree:
-Are X and Y measured with error? If No → Use OLS (Model I) -Are the errors in X and Y approximately equal? If Yes → Use MA -Are X and Y measured in different units/scales? If Yes → Consider SMA -Is the correlation between X and Y weak (<0.7)? If Yes → Method choice is critical; consider RMA -Are you uncertain about error structure? If Yes → RMA is a reasonable compromise
Remember that when the correlation between X and Y is very strong (r > 0.9), all methods will yield similar results, making the choice less critical. The differences between methods become more pronounced as the correlation weakens.
Finally, it’s often valuable to run multiple methods and compare the results. If they lead to different ecological or biological interpretations, this should be explicitly addressed in your discussion.
Call:
lm(formula = y ~ x, data = data_ols_m2)
Residuals:
Min 1Q Median 3Q Max
-8.058 -3.498 -0.990 2.946 16.070
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1346 2.6584 1.179 0.241
x 2.8745 0.2617 10.982 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.858 on 98 degrees of freedom
Multiple R-squared: 0.5517, Adjusted R-squared: 0.5471
F-statistic: 120.6 on 1 and 98 DF, p-value: < 2.2e-16
Model II regression
Call: lmodel2(formula = y ~ x, data = data_ols_m2, range.y =
"relative", range.x = "relative", nperm = 99)
n = 100 r = 0.7427668 r-square = 0.5517025
Parametric P-values: 2-tailed = 9.070988e-19 1-tailed = 4.535494e-19
Angle between the two OLS regression lines = 8.317517 degrees
Permutation tests of OLS, MA, RMA slopes: 1-tailed, tail corresponding to sign
A permutation test of r is equivalent to a permutation test of the OLS slope
P-perm for SMA = NA because the SMA slope cannot be tested
Regression results
Method Intercept Slope Angle (degrees) P-perm (1-tailed)
1 OLS 3.134600 2.874473 70.81773 0.01
2 MA -18.688419 5.059928 78.82062 0.01
3 SMA -6.805843 3.869953 75.51163 NA
4 RMA -8.131038 4.002664 75.97273 0.01
Confidence intervals
Method 2.5%-Intercept 97.5%-Intercept 2.5%-Slope 97.5%-Slope
1 OLS -2.140952 8.410152 2.355051 3.393894
2 MA -29.748586 -10.906922 4.280654 6.167542
3 SMA -12.339088 -1.965648 3.385234 4.424077
4 RMA -16.211541 -1.554125 3.344023 4.811882
Eigenvalues: 54.08817 1.502864
H statistic used for computing C.I. of MA: 0.001181286
2.5 % 97.5 %
(Intercept) -2.140952 8.410152
x 2.355051 3.393894
Call: sma(formula = y ~ x, data = data_ols_m2, method = "SMA")
Fit using Standardized Major Axis
------------------------------------------------------------
Coefficients:
elevation slope
estimate -6.805843 3.869953
lower limit -12.094381 3.385234
upper limit -1.517306 4.424077
H0 : variables uncorrelated
R-squared : 0.5517025
P-value : < 2.22e-16
Call: sma(formula = y ~ x, data = data_ols_m2, method = "MA")
Fit using Major Axis
------------------------------------------------------------
Coefficients:
elevation slope
estimate -18.688419 5.059928
lower limit -27.905282 4.280260
upper limit -9.471556 6.168337
H0 : variables uncorrelated
R-squared : 0.5517025
P-value : < 2.22e-16